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In this report we present a novel approach to model coupling for shared- 
memory multicore systems hosting OpenCL-compliant accelerators, which we 
call The Glasgow Model Coupling Framework (GMCF). We discuss the imple¬ 
mentation of a prototype of GMCF and its application to coupling the Weather 
Research and Forecasting Model and an OpenCL-accelerated version of the Large 
Eddy Simulator for Urban Flows (LES) developed at DPRI. 

The first stage of this work concerned the OpenCL port of the LES. The 
methodology used for the OpenCL port is a combination of automated analysis 
and code generation and rule-based manual parallelization. Eor the evaluation, 
the non-OpenCL LES code was compiled using gfortran, ifort and pgfortran, 
in each case with auto-parallelization and auto-vectorization. The OpenCL- 
accelerated version of the LES achieves a 7x speed-up on a NVIDIA GeEorce 
GTX 480 GPGPU, compared to the fastest possible compilation of the original 
code running on a 12-core Intel Xeon E5-2640. 

In the second stage of this work, we built the Glasgow Model Coupling Erame- 
work and successfully used it to couple an OpenMP-parallelized WRE instance 
with an OpenCL LES instance which runs the LES code on the GPGPI. The system 
requires only very minimal changes to the original code. The report discusses 
the rationale, aims, approach and implementation details of this work. 
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1 Introduction 


1.1 The Weather Research and Forecasting Model 

The Weather Research and Forecasting Modej^ (WRF) is a mesoscale numerical weather pre¬ 
diction system (NWS) intended both for forecasting and atmospheric research. It is an Open 
Source project, used by a large fraction of weather and climate scientists worldwide. The 
WRF code base is written in Fortran-90 and quite complex and extensive (about 1,000,000 
lines of code). 

1.2 The Large Eddy Simulator for the Study of Urban Boundary-layer Flows 

The Large Eddy Simulator for the Study of Urban Boundary-layer Flows (LES) is devel¬ 
oped by Hiromasa Nakayama and Haruyasu Nagai at the Japan Atomic Energy Agency 
and Prof. Tetsuya Takemi at the Disaster Prevention Research Institute of Kyoto Univer¬ 
sity [IZl im . It generates turbulent flows by using mesoscale meteorological simulations, and 
was designed to explicitly represent the urban surface geometry. Its purpose is to conduct 
building-resolving large-eddy simulations (LESs) of boundary-layer flows over urban areas 
under realistic meteorological conditions. WRF is used to compute the wind profile as input 
for LES, so there is a clear case for coupling both models. 

1.3 Model Coupling 

“Model coupling” means using data generated by one model as inputs for another model: 
e.g. climate simulations: atmosphere model, ocean model, land model, ice model. In com¬ 
bination with hardware Acceleration using GPU/manycore/EPGA, model coupling could 
result in much reduced run times and/or higher accuracy simulations. 

Model Goupling is of growing importance because models of e.g climate need to be as 
accurate as possible, and therefore include many factors and effects. Greating a single model 
incorporating all effects is increasingly difficult, and requires very large research teams to 
cover all specialisms. Gombining existing models allows us to model a large variety of very 
complex systems 

There are a number of existing libraries to support the process of model coupling, e.g. 
MGT, ESME, OASIS [l6l[3j|2]]. However, each of them requires hand modification of existing 
model codes, as well as writing of additional code to control the way the coupling is done 
needs to be written as well. Eurthermore, current Model Goupling libraries all use MPI. 

1.4 A Novel Approach to Model Coupling for MultIcore/GPU Nodes 

Greating coupled models using the current approach is very difficult and requires team of 
experts in various disciplines. As a result model coupling is too hard for most research 
teams. A better approach will allow more researchers to use model coupling and do better 
science. 


^http://www.wrf-model.org 
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Distributed system 


Inter-node distributed memory (MPI) 

- NxM coupling communication 
(orange) 

- Nearest-neigbour Grid 
communication (blue, green) 



Combined Shared/Distributed System 

- Coupling communication: Intra-node 
shared-memory (Pthreads) 

- Grid communication: Inter-node 
distributed memory (MPI) 


Figure 1: Current and proposed approaches to model coupling 


Furthermore, the current approaches were developed for clusters of single-core machines 
and are not ideal for multicore/GPU nodes. In particular, the current approach to load 
balancing requires that every model i 

gets a fixed number of nodes rrii, proportional to its run time, i.e. m; = aAt;. As a 
result, data from ntj nodes of model t need to be transferred to rrij nodes of model j. This is 
complex and asymmetrical, esp. for more than two models. 

Our proposed approach (Fig. is to limit coupling to processes running on a shared- 
memory node. The nodes run all processes required for coupling in threads, if a thread is 
not needed for coupling it goes to sleep and does not consume CPU time. As a result, the 
cluster communication is more symmetrical, and the load in each node can be balanced 
more easily. 

2 The OpenCL-Accelerated LES 

The DPRI Large Eddy Simulator (LES) is a high-resolution simulator which models flow over 
urban topologies. The main properties of the DPRI LES are: 

• Generates turbulent flows by using mesoscale meteorological simulations. 

• Explicitly represents the urban surface geometry. 

• Used to conduct building-resolving large-eddy simulations (LESs) of boundary-layer 
flows over urban areas under realistic meteorological conditions. 
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progrom main 


include ’common.sn' ! Porometer & variable setting, declaration 

call set 
call grid 

call timdota 
call init 
call ifdata 

c--main loop 

do 1000 n - n0,nmax 
time - floatCn-l)*dt 

c-^H--colculate turbulent flow-c 

call velnw 
call bondvl 
call velFG 

if(ifbf.eq.l) then 

call feedbf [calculate building effect 

end if 

call les 
call adorn 
call press 

c—HH--dato output -c 

call timseris 
coll Gveflow 
if(ianime.eq.l) then 
call anime 
endif 
c 

if(n.eq.nmax) then 
stop 
end if 
10M continue 
c 

end program 


Figure 2: LES main program (Fortran-77) 


• Essentially solves the Poisson equation for the pressure, using Successive Over-Relaxation 

• Written in Fortran-77, single-threaded, about a thousand lines of code. 


2.1 LES Code Structure - Functional 

The LES structure consists of sequential calls to follotving subroutines for each time step: 
velnw: Update velocity for current time step 

bondvl : Calculate boundary conditions (initial wind profile, inflow, outflow) 

velfg: Calculate the body force 

feedbf: Calculation of building effects 

les: Calculation of viscosity terms (Smagorinsky model) 

adam: Adams-Bashforth time integration 

press: Solving of Poisson equation (SOR) 
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Subroutine 

Block 

Parallelisation 

velnw 


data parallel over full domain 

bondvl 

in it uvw 

data parallel over full domain 


calc uout 

reduction over full domain 


calc uvw 

data parallel over boundary planes 

velfg 


data parallel over full domain 

feedbf 


data parallel over full domain 

les 

calc sm 

data parallel over full domain 


bound sm 

data parallel over boundary planes 


calc vise 

data parallel over full domain 

adam 


data parallel over full domain 

pres 

rhsav 

reduction over full domain 


sor 

reduction over full domain + iteration 


pav 

reduction over full domain 


adj 

data parallel over full domain 


boundp 

data parallel over boundary planes 


Table 1: LES Computational structure and parallelization strategy for each block 
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2.2 LES Code Structure - Computational 

2.3 Methodology 

In this section we discuss our novel methodology for porting Fortran NWP applications to 
OpenCL. 

The approach used in this work is as follows: 

1. Convert the original F77 code to F95, remove common blocks, if required refactor 
into subroutines. All this is done using our rf4a tool. The resulting F95 code has fully 
explicit subroutine arguments. 

2. For each subroutine that could become a kernel, we created a wrapper in two stages: 

a) We generate a wrapper from the original subroutine using the pragma 

! $ACC Kerne IWrapper(adain) 

call adaiii(fgh,fgh_old,iin,jin,kin) 

!$ACC End KernelWrapper 

This generates a template for a module module_adain_ocl which contains a sub¬ 
routine adaiii_ocl. We manually edit the generated code if required. 

• The template file (module_adain_ocl_TEMPL. f 95) is valid Fortran-95, if com¬ 
piled and run it will be functionally identical to the original subroutine. 

b) We generate the OpenCL API from the template using the pragma 

! $ACC Kernel (adaiii_kernel), GlobalRange(iin*jm*km), LocalRange (0) 
call adain(fgh,fgh_old, im, jiii,kiii) 

!$ACC End Kernel 

This actually uses even lower-level pragmas underneath, but these are normally not 
exposed. The code generator requires a Fortran subroutine for adain_kernel, which 
initially is a copy original routine adam, but can be edited if the OpenCL kernel argu¬ 
ments would be different. 

This generates inodule_adain_ocl. f 95 which contains all necessary calls to OpenCL 
using our OdWrapper library. 

• If we compile and run the code at this stage it will fail because the actual OpenCL 
kernel does not yet exist. 

3. At this stage we have a wrapper with OpenCL API calls and a “kernel” in Fortran, 
initially a copy of the original subroutine. We convert this kernel to C using our 
patched versiorj^ of F2C_ACC. Then we further convert this C code to OpenCL, with 
some cleaning up. At the moment, this stage is manual, because the conversion script 
is not yet ready. 


^https: / / github.eom/wimvanderbauwhede/RefactorF4Acc/tree/master/F2C-ACC 
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program mam 


finclude "module.and.variable.declarations.inc" 

I ---.. 


call setCdatal0,datall,data20,dato21,data22,data23,data24,data25,data26tdata27,data30,data31, & 
im,jm,km,ifbf,ianime,ical,n0,nl,nmax,dt,ro,vn,alpha,beta) 
call gridCdxl,dxl,dyl,dyl,z2,dzn,dzs,dxs,dys) 

call timdataO 

call init(km,jm,im,u,v,w,p,ifbf,cn2s,dxs,cn2l,cn3s,dys,cn3l,dzs,cn4s,cn4l,cnl,amaskl,bmaskl, & 
cmaskl,dmaskl,zbm,z2,dzn) 

call ifdataCical,data30,n,time,u,im,jm,km,v,w,p,usum,vsum,wsum,data31,fold,gold,hold,fghold, & 
ifbf,delxl,dxl,dyl,dzn,diul,diu2,diu3,diu4,diu5,diu6,diu7,diu8,diu9,sm,f,g,h,z2,dt,dxs,covl, & 
cov2,cov3,dful,vn,cov4,cov5,cov6,dfvl,cov7,cov8,cov9,dfwl,dzs,noul,nou5,nou9,nou2,nou3,nou4, & 
nou6,nou7,nou8,bmaskl,cmaskl,dmaskl,alpha,beta,fx,fy,fz,amaskl,zbm) 

! --main loop 

do n • n0,nmax 

time - floatCn-l)*dt 

j -calculate turbulent flow-c 

call velnw(km,jm,im,p,ro,dxs,u,dt,f,dys,v,g,dzs,w,h) 
call bondvl(jm,u,z2,dzn,v,w,km,ical,n,im,dt,dxs) 

call velfg(km,jm,im,dxl,covl,cov2,cov3,dful,diul,diu2,dyl,diu3,dzn,vn,f,cov4,cov5,cov6,dfvl, 
diu4,diu5,diu6,g,cov7,cov8,cov9,dfwl,diu7,diu8,diu9,dzs,h,noul,u,nou5,v,nou9,w,nou2,nou3, 
nou4 , nou6 , nou7 , nou8) 

ifCifbf -- 1) then 

call feedbfCkm,jm,im,usum,u,bmaskl,vsum,v,cmaskl,wsum,w,dmoskl,alpha,dt,beta,fx,fy,fz,f,g, & 
h) 

end if 

call lesCkm,delxl,dxl,dyl,dzn,jm,im,diul,diu2,diu3,diu4,diu5,diu6,diu7,diu8,diu9,sm,f ,g,h) 
call adam(n,nmax,data21,fold,im,jm,km,gold,hold,fghold,f,g,h) 

call press(km,jm,im,rhs,u,dxl,v,dyl,w,dzn,f,g,h,dt,cnl,cn2l,p,cn2s,cn3l,cn3s,cn4l,cn4s,n, & 
nmax,data20,usum,vsum,wsum) 

j -data output-c 

call timserisCn,dt,u,v,w) 

call aveflow(n,nl,km,jm,im,aveu,avev,avew,avep,avel,aveuu,avew,aveww,avesm,avesmsm,uwfx, & 
avesu,avesv,avesw,avesuu,avesvv,avesww,u,v,w,p,sm,nmax,uwfxs,datal0,time,datall) 
if(ianime — 1) then 

call animeCn,n0,nmax,km,jm,im,dxl,dxl,dyl,dyl,z2,data22,data23,u,w,v,amaskl) 
endif 

! 

if(n — nmax) then 
stop 
end if 
end do 

end program 


Figure 3: LES main program (Fortran-95) 
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• So at this point we have an actual fully working OpenCL version of the code, 
albeit a purely sequential one. 

4. In this work we manually parallelized the kernels. Ideally, we would like to use some 
compiler transformation to assist us with this, esp. loop merging. 

5. Finally, we merged all kernels into a super-kernel, and also merged the wrappers. In 
this work we did this manually but the process is straightforward to automate. 


2.4 Boundary Conditions Computation 

A very effective way to compute boundary conditions in OpenCL is the boundary range 
approach, where we create a global range consisting of the sum of the products of the 
domain dimensions for each direction: if the domain is ip x jp x kp then the boundary 

range is ip x jp + jp x kp + ip x kp. In this way, the local range can be set to auto. The 

kernel has an if-statement to calculate the different boundary conditions: 

if (gl_id < jp*kp ) { 

unsigned int k = gl_id / jp; 

unsigned int j = gl_id % jp; 

// do work 

y else if (gl_id < jp*kp + kp*ip) { 
unsigned int k = (gl_id - jp*kp) / ip; 

unsigned int i = (gl_id - jp*kp) % ip; 

// do work 

y else if (gl_id < jp*kp + kp*ip + jp*ip) { 

unsigned int j = (gl_id - jp*kp - kp*ip) / ip; 

unsigned int i = (gl_id - jp*kp - kp*ip) % ip; 

// do work 

> 

The global range is padded to a multiple of the product of the suggested number of threads 
and the number of compute unit^ as this results in much better load balancing: 

unsigned int m = nthreads*nunits; 

if (range % m) != 0) { 
range += m - (range % m) 

> 

Consequently, the last branch in the if statement in the kernel must also be guarded. 

2.5 Some Common Techniques 
2.5.1 Using OpenCL Vectors for Locality 

The LES uses separate arrays for values in x, y and z directions. We merged those arrays into 
float4 arrays, to improve locality and alignment. Thus, the arrays f,g,h which are actually 
the values of the force field f in the x, y and z direction, were merged into fgh. 

^based on CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and CL_DEVICE_MAX_COMPUTE_UNITS 
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program main 


#include "module_and_variable_declarations.inc" 

! . 

I 

coll Setcdatal0,datall,data20,data21,data22,data23,data24,data25,data26,data27,data30,data31,& 
im,jm,km,ifbf,ianime,ical,n0,nl,nmax,dt,ro,vn,olpha,beta) 
coll gridCdxl , dxl , dyl , dyl , z2 , dzn,dzs , dxs , dys) 

coll timdatoC) 

coll initCkm,jm,im,u,v,w,p,ifbf,cn2s,dxs,cn21,cn3s,dys,cn31,dzs,cn4s,cn41,cnl,atiioskl,bmaskl, & 
coll ifdataCical,data30,n,time,u,in,jm,km,v,w,p,usum,vsum,wsum,data31,fold,gold,hold,fghold, & 
ifbf,delxl,dxl,dyl,dzn,diul,diu2,diu3,d\u4,d\u5,diu6,diu7,diu8,diu9,sm,f,g,h,z2,dt,dxs,covl, & 
cov2,cov3,dful,vn,cov4,cov5,covG,dfvl,cov7,cov8,cov9,dfwl,dzs,noul,nou5,nou9,nou2,nou3,nou4, & 
nou6 , nou7 , nou8 , bmoskl , cmoskl , dmoskl , olpho , beta , fx , fy , fz , amaskl , zbm) 

coll initialise_LES_kernelC & 

p,u,v,M,usum,vsum,wsum,f,g,h,fold,gold,hold, & 

dial, diu2, diu3, diu4, diuS, diu6, diu7, diu8, diu9, & 

amaskl, bmaskl, cmaskl, dmaskl, & 

cnl, cn21, cn2s, cn31, cn3s, cn41, cn4s, & 

rhs, sm, dxs, dys, dzs, dxl, dyl, dzn, z2, & 

dt, im, jm, km, nmax & 

) 

! —main loop 

do n - n0,nmax 

time » floatCn-l)*dt 

I -calculate turbulent flow-c 

call run_LES_kernel ( & 
data20, data21, & 
im, jm, km, & 
dt, dxl,dyl,dzn, & 
n , nmax & 

, p,u,v,w,f,g,h,fold,gold,hold & 

) 

I -data output -c 

call timserisCn,dt,u,v,w) 

call aveflo»(n,nl,km,jm,im,aveu,avev,avew,avep,avel,aveuu,avevv,aveww,avesm,avesmsm,uwfx, & 
avesu,avesv,avesw,avesuu,avesvv,avesww,u,v,w,p,sm,nmax,uwfxs,datal0,time,datall) 
call anime(n,n0,nmax,km,jm,im,dxl,dxl,dyl , dyl,z2,data22,data23,u,w,v,amaskl) 

! 

if(n — nmax) then 

stop 

end if 

end do 

end program 


Figure 4: LES main program with call to OpenCL kernel wrapper 


9 







2.5.2 Using Private Scaiars Instead of Giobai Arrays 

In several places, the LES computes on global arrays, e.g. the f/g/h arrays. Where possible, 
we replaced the global access to fgh(i,j,k) by a local scalar fgh ijk, in effect a form of manual 
caching. 

2.5.3 Pre-computing Neighboring Points 

The original computation of f/g/h (in the velfg routine) first computes cov and diu arrays 
for the full domain, and then accesses them at neighboring points to compute f/g/h: 


covxl = (dxl(i+l)*covl(i,j,k)+dxl(i)*covl(i+l,j,k)) /(dxl(i)+dxl(i+l)) 
covyl = (cov2(i,j,k)+cov2(i,j+l,k))/2. 
covzl = (cov3(i,j,k)+cov3(i,j,k+l))/2. 
cove = covxl+covyl+covzl 

dful(i,j,k) = 2.*(-diul(i,j,k)+diul(i+1,j,k))/(dxl(i)+dxl(i+1)) + (-diu2(i,j,k)+diu2( 
j+1,k))/dyl(j) + (-diu3(i,j,k)+diu3(i,j,k+l))/dzn(k) 
df = vn*dful(i,j,k) 
f(i,j,k) = (-covc+df) 

By pre-computing the values for the cov and diu arrays at points i-|-l/j-|-l/k-|-l, the loops 
can be merged: 


float covxl = (dxl[i+2]*cov_ijk.sO+dxl[i+1]*cov_ijk_pl.sO) /(dxl[i+1]+dxl[i+2]); 
float covyl = (cov_ijk.sl+cov_ijk_pl.sl)/2.OF; 
float covzl = (cov_ijk.s2+cov_ijk_pl.s2)/2.OF; 
float cove = covxl+covyl+covzl; 

float dful_ijk = 2.OF*(-diu_ijk.sO+diu_ijk_pl.sO)/(dxl[i+1]+dxl[i+2]) + (-diu_ijk.sl+i 
float df = vn*dful_ijk; 
fgh_ijk.sO = (-covc+df); 

2.6 SOR Algorithm Implementation 

The major bottleneck of the LES is the solver for the Poisson equation, which uses Successive 

Over-Relaxation. The original algoritm is implemented using the red-black scheme: 

do nrd =0,1 
do k = 1,km 
do j = l,jm 

do i = l+mod(k+j+nrd,2),im,2 
reltmp = omega*(cnl(i,j,k) *( & 
cn21(i)*p(i+l,j,k) + & 
cn2s(i)*p(i-l,j,k) + & 
cn31(j)*p(i,j+1,k) + & 
cn3s(j)*p(i,j-l,k) + & 
cn41(k)*p(i,j,k+l) + & 
cn4s(k)*p(i,j,k-l) - & 
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rhs(i,j,k))-p(i,j,k)) 
p(i,j,k) = p(i,j,k) +reltmp 
sor = sor+reltmp*reltmp 
end do 
end do 
end do 
end do 

Conceptually, the p array is divided in red and black points so that every red point has black 
nearest neighbors and vice-versa. The new values for p are computed in two iterations (the 
nrd-loop in the code example), one for the red, one for the black. 

While this is very effective for single-threaded code, and in fact also for parallel code 
on distributed memory systems, it suffers from poor locality because the accesses to p are 
strided, and the correction computation requires access to all six neighbors of p. As a 
result, the threads in each compute unit cannot perform coalesced reads or writes. InH^, 
Konstantinidis and Cotronis explore a GPU implementation of the SOR method and conclude 
that their proposed approach of reordering the matrix elements according to their color 
results in considerable performance improvement. However, their approach is not readily 
applicable to our problem because one the one hand we have a 3-D array which is much 
harder to reorder than a 2-D array (i.e. the cost of reordering is higher) and also, we cannot 
use the reordered array as-is, so we would incur the high reordering cost twice. 

Therefore, we developed a new technique which we call “twinned double buffering”: 
rather than using the read-black scheme for updating p, we use a “double buffer” update 
scheme for p: for nrd=0, p_l is updated by values from p_0, and for nrd=l, vice versa. If 
we would use two arrays for this, locality would still be poor, so instead we use an array of 
vectors of two floats, which we called a “twinned” array. Using this data structure and the 
double buffering scheme, and assigning the compute units to the k index and the threads 
in the compute unit to the i index of the array, we obtain coalesced memory access. The 
difference in performance is indeed huge: our approach is 6x faster than the parallelized 
version of the red-black scheme. 

2.7 Evaluation 

In this section we present a validation study of the OpenCL LES by comparing it to the orig¬ 
inal Fortran version. We ran the LES on a domain of 150x150x90 points, with 50 iterations 
for SOR (default setup). 

2.7.1 Compilers 

The compilers used for the comparison were gfortran 4.8.2 for OpenCL code, as well as 
pgf77 12.5-0 and ifort 12.0.0 for the reference code. We used the following optimizations 
for auto-vectorization and auto-parallelization: 

• gfortran -Ofast -floop-parallelize-all -ftree-parallelize-loops=24 

• pgf77 -03 -fast -Mvect=simd:256 
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#cores 

vector size 

Clock speed (GHz) 

CPI 

(-FLOPS) 

Memory BW 
(GB/s) 

Intel Xeon E5-2640 

24 

8 

2.5 

480 

42.6 

Nvidia GeForce GX480 

15 

32 

1.4 

672 

177.4 


Table 2: Hardware Performance Indicators 


LES OpenCL: Speed-up wrt orginal LES code 


Original code (F77) on CPU. ifort (*) 
Original code (F77) on CPU, pgf77 
Original code (F77) on CPU. gfortran 
Refactored code (F95) on CPU, ifort 
Refactored code (F95) on CPU, pgf95 
Refactored code (F95) on CPU, gfortran 
OpenCL on Multicore CPU 
OpenCL on GPU 



012345678 


Speed-up wrt. onginal code (*) 


Figure 5: Comparison of OpenCL LES with original Fortran code 


• ifort -03 -parallel 

The run time of the F77 and F95 code is the same with all compilers (to within a few %) 

2.7.2 Hardware platforms 

The host platform is an Intel Xeon E5-2640 0 @ 2.50GHz, dual-socket 6-core CPU with two- 
way hyperthreading (i.e. 24 threads), with AVX vector instruction support, 32GB memory, 
15MB cache, Intel OpenCL vl.2. The 

GPU platform is an NVIDIA GeForce GTX 480 @ 1.40 GHz, 15 compute units, 1.5GB 
memory, 250KB cache, NVIDIA OpenCL 1.1 (CUDA 6.5.12). 

Table shows the Hardware Performance Indicators for both systems. 


2.7.3 OpenCL LES Results 

The results of the comparison of the OpenCL code with the F77 and F95 reference code 
results are summarized in Figure [^. The OpenCL-LES running on the GPU is 7\times faster 
than the fastest reference runtime. 

We implemented all kernels of the LES in OpenCL, because this allows us to keep all data 
structures in the GPU memory, rather than copying between the host and the GPU. Figure]^ 
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LES: Proportion of time spend in each routine 


PRESS 

ADAM 

LES 

FEEDBF 

VELFG 

BONDVl 

VELNW 



0% 10% 20% 30% 40% 50% 60% 70% 80% 


Figure 6: Proportion of time spent in each routine 


shows the breakdown of the run time per subroutine in the original LES code. We see that 
the press routine, which contains the SOR, takes most of the run time, followed by the velfg 
and les routines. Figure 2.7.3 shows the corresponding times for the actual OpenCL kernels. 
It should be noted that 50 SOR iterations is actually on the low side to achieve convergence. 
So the SOR routine will dominate the run time entirely for more iterations. 


2.8 OpenCL LES Evaluation Conclusion 

The most important outcome of the effort to convert the LES to OpenCL is the development 
of an open-source toolchain to facilitate porting of Fortran-77 code to OpenCL. Using this 
toolchain considerable reduces the porting effort, as illustrated by our work. Furthermore, 
the porting of the LES specifically has led to the development of a novel parallel implemen¬ 
tation of the 3-D Successive Over-relaxation algorithm. The resulting performance is very 
good: the OpenCL LES running on GPU is seven times faster than the original code running 
on a multicore CPU. 


3 The Glasgow Model Coupling Framework 

In this section we introduce the Glasgow Model Coupling Framework. We discuss the the 
longer-term aims of the framework, and status of the current prototype. 
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OpenCL LES: Contributions of kernels to total run time 
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Figure 7: Proportion of time spent in each OpenCL kernel 

3.1 Longer-term aims 

Our longer-term aim is to create a system where there is no need for the end user to write 
any code at all to achieve model coupling. Instead, we envisage that our system will use 
a scenario, written in a formal way but in a natural language, to express how the models 
should be coupled. The scenario will serve as the input to a compilation and code generation 
system that will create the fully integrated model coupling system. 

Designing such a specification language is one of the major research tasks to achieve this 
goal. We plan to use a system based on Multi-Party Session Types []3l to ensure that the 
scenario results in a correct system that e.g. is deadlock-free and where data exchanges be¬ 
tween incompatible senders and receivers are impossible. The Scribbl^ protocol language 
[l4ll uses MPST and therefore constitutes a natural starting point for our investigation. 

However, in order to make the scenarios accessible to a large audience, we want to explore 
the use of natural language based specification, e.g. the Gherkin DSL as used in behavior- 
driven development [jl]] . 

Furthermore, apart from the scenario, there is also a need for a formal specification of 
each model participating in the scenario. One of the reasons why model coupling is currently 
very difficult is that it takes a lot of effort and skill to determine where e.g. the wind velocity 
is generated in a given model, what the time loop is, what the variables and their dimensions 
are, etc. Our aim is to design a model specification that allows a code analysis tool to work 
out all necessary information - as opposed to a specification that would provide all this 
information, as such as specification would be difficult and time consuming to produce. 


"^http : / / WWW. scribble. org/ 
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3.2 GMCF Model Coupling Operation 

Adopting the classification terminology used by Michalakes GMCF adopts a component- 
based coupling model with dataflow-style peer-to-peer interaction. There is no centralized 
control or scheduling. Instead, models communicate using a demand-driven mechanism, 
i.e. they send request to other models for data or control information. At each simulation 
time step, each model requests time stamps from the models it is coupled with, and waits 
until it has received the corresponding time steps. This synchronization step ensures that 
the timing of data exchanges is correct. After syncing, each model can request and/or send 
data either before or after computation. When the time loop of a model finishes, the model 
broadcast a message to all communicating models. 

3.3 Approach and Status for Current Prototype 

To demonstrate the main concepts of our approach to Model Coupling, we created a proto¬ 
type with the express aim to couple WRF and the DPRILES modelj^using a producer/consumer 
pattern. As we will see, the same prototype is readily useable for symmetrical data exchange 
as well. 

We created a modified version of our Glasgow Parallel Reduction Machine (GPRM) frame- 
worl^ GPRM is a framework for task-based parallel computing, and one can consider Model 
Coupling as a special case of this. We call our new framework the Glasgow Model Coupling 
Framework (GMCF][^ The necessary modifications are related two different aspects: 

• GPRM Is a C-l—I- framework, and most scientific models are written in Fortran. There¬ 
fore we developed an automated approach to integrating Fortran code into GPRM. 

• GPRM uses run-to-completion semantics, i.e. once a task is started, it runs to comple¬ 
tion and only then produces output and checks for new input data. In a long-running, 
time-step-base application such as a NWP model, this approach is not practical, as 
it would be a huge effort to lift the the time loop out of the existing model and into 
GPRM. Therefore we created a new API which allows communication between models 
from inside the model code, obviously essential for model coupling. 

Apart from those major changes, we also changed the code organization and the build 
system to simplify the creation of a coupled model using GMGF. 


3.4 More detail on the GMCF architecture 


3.4.1 Run-time architecture 


The run-time architecture of GMGF is illustrated in Fig. 3.4.1 On startup, the GMGF runtime 
system creates a fixed number of threads. Each of these threads contains a main loop which 
blocks on a EIEO. 


^http://www2.mmm.ucar.edu/wrf/users/workshops/WS2010/presentations/Tutorials/Coupling%20WRF%20Michalakes.pdf 
^https: / / github.com/wimvanderbauwhede/LES 
^https: / / github.com/wimvanderbauwhede/GannetCode 
®https: / / github.com/wimvanderbauwhede/gmcf 
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Figure 8: GMCF runtime architecture. 


Every model is allocated to a thread and interfaces with the other models through a set of 
FIFOs. The communication between the models is conceptually based on packets of different 
types. In other words, GMCF uses a message-passing approach. As a result, it is in principle 
possible to use GMCF for distributed-memory model coupling as well as shared-memory 
model coupling. When a packet is read from the main RX fifo it can be either consumed 
directly by the model or stored in a per-type fifo for later use. 

The system is demand-driven: there are request packets and response packets. The in¬ 
formation to exchange between the models is either time-related, data-related or control- 
related. 

The low-level GMCF API provides functionality to wait on the main FIFO, to separate 
packets out into per-type FIFOs, to shift packets from the FIFOs and push packets onto the 
FIFOs. The higher-level API hides the packet abstraction, instead it expresses synchroniza¬ 
tion and data transfer. 

3.4.2 Software architecture 

Each of the threads in the GMCE runtime instantiates an object of the Tile class and calls its 
runO method, which implements the wait loop. The Tile class contains the EIEOs and the 
Core class, which takes care of the actual model calls (Eigure [3.4.2| ). 

Our approach is to transform the Eortran top-level program unit into a subroutine which 
takes a pointer to the tile as argument (Eigurejl^. All model subroutines become methods 
of a singleton Core class which inherits from the base Core class which provides the API 
calls. This approach allows us to hide any state in the object. The Core class has a run() 
method which selects the model-specific method based on the thread in which it is called at 
the start of the run. 

Eurthermore, the Eortran API consists of a generic part and a per-model part, both im¬ 
plemented as modules. The generic part requires the tile pointer as an argument for each 
call. The per-model API stores this pointer in its module so that the final API calls require 
no additional arguments. 
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Figure 9: GMCF software architecture 


When the Core.runQ method calls the model subroutine, this subroutine can use the 
GMCF C++ API method calls to interact with the FIFOs and so with the other models in the 
system. 

3.4.3 Code generation architecture 

The main purpose of GMCF is to make model coupling easier. Therefore, we try to minimize 
the necessary changes to the original code, and in fact our long-term aim is to entirely 
automate the process. Currently, the user has to manually insert the actual model coupling 
API calls. 

Apart from that, the build system takes care of the entire integration. This is less obvious 
than might seem at first: the GMCF runtime and API are written in C++, so it is necessary to 
generate code to interface with the Fortran model code. The necessary information required 
from the user is very limited: the full path of the top-level program unit of each model. The 
build system analyses this program unit and adds the necessary code for GMCF integration, 
and generates all required interface code. 

4 Coupling WRF and LES 

In this section we present the implementation and preliminary results of coupling WRF and 
the OpenCL LES using the GMCF. 
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Figure 10: GMCF Fortran-C++ integration and code generation 
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Figure 11: Coupling WRF and the Large Eddy Simulator 


WRF is used to compute the wind profile as input for LES. In the original version of the 
LES, this input is manually extracted from WRE and hardcoded in the Eortran source. 

4.1 Implementation Details 

In order to achieve coupling, some modifications are required to both WRE and LES. On 
the one hand, the build system needs to be modified so that the model is compiled into a 
library rather than an executable. Eor LES, as we use SCons, this is trivial. WRE uses a 
complex build infrastructure based on make. The changes are still very small, and some of 
the changes are actually patches to bugs in the WRE build system. In fact, the modified WRE 
build system can now build WRF executables using both Fortran and C++ main programs. 
The source code of the main program unit needs to be modified to make it a subroutine, 
this is a very minor change. The other main changes are related to the actual coupling. Our 
approach is to create a per-model API based on the generic GMCF API. Eventually, this per- 
model API will be automatically generated. Then the API is used to achieve synchronization 
and data exchange. 

4.1.1 Creating a gmcf subroutine from the main program unit 

The changes to main/wrf . F and the LES main, f 95 required to be able to use GMCE are 
shown below; these changes are auto-generated by the build system, the user only needs to 
provide the name of the original program and of the new top-level subroutine. 


subroutine program_wrf_gmcf(gmcf_ptr, model_id) ! This replaces ’program wrf’ 
use gmcfAPI ! gmcf 

use gmcfAPIwrf ! gmcf 
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USE module_wrf_top, only : wrf_init, wrf_dfi, wrf_run, wrf_finalize 
IMPLICIT NONE 

integer(8) , intent(In) :: gmcf_ptr ! gmcf 
integer , intent(In) :: model_id ! gmcf 

call gmcfInitWrf(gmcf_ptr, model_id) ! gmcf 
! Set up WRF model. 

CALL wrf_init 

! Run digital filter initialization if requested. 

CALL wrf_dfi 

! WRF model time-stepping. Calls integrateO. 

CALL wrf_run 

! WRF model clean-up. This calls MPI_FINALIZE() for DM parallel runs. 
CALL wrf_finalize 
end subroutine program_wrf_gmcf 

And for the LES: 

subroutine program_les_gmcf(gmcf_ptr, model_id) 

! ... LES-specific use statements ... 
use gmcfAPI 
use gmcfAPIles 

integer(8) , intent(In) :: gmcf_ptr 
integer , intent(In) :: model_id 
! ... LES-specific declarations ... 

call gmcfInitLes(gmcf_ptr, model_id) 

! ... LES-specific code 

end subroutine program_les_gmcf 

4.1.2 Time synchronization and data exchange 

The actual synchronization and data exchanges also requires very few changes, currently 
these have to be done manually. In WRF, these are in f rame/module_integrate . F: 

MODULE module_integrate 
CONTAINS 

RECURSIVE SUBROUTINE integrate ( grid ) 
use gmcfAPI 
use gmcfAPIwrf 
! ... WRF-specific code ... 

! WRF main time loop 

DO WHILE ( .NOT. domain_clockisstopsubtime(grid) ) 

! Synchronise simulation times between WRF & LES 
call gmcfSyncWrf 

! ... Actual model computations ... 

! Prepare the wind profile 

call gmcfCreateWindprofile(grid°/oU_2,grid°/oV_2,grid%w_2) 

! Send the data to the LES when requested 
call gmcfPostWrf 
END DO 
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call gmcfFinishedWrf 
! ... WRF-specific code . . . 

END SUBROUTINE integrate 
END MODULE inodule_integrate 

In LES, the time loop is in main, f 95: 

do n = nO,nmax 
time = float(n-l)*dt 
! Synchronise simulation times between WRF & LES 
call gmcfSyncLes 
! Request and receive the data from WRF 
call gmcfPreLes 

if (can_interpolate == 0) then 
can_interpolate = 1 

else 

! Interpolate the value for the current LES time step 
call gmcfInterpolateWindprofileLes(u,v,w) 

! ... Actual model computations 
end if ! interpolate guard 
end do 

call gmcfFinishedLes 

Here the code is a little bit more involved because LES only takes data from WRE every 
simulated minute but simulates with a resolution of half a second. The WRE data are inter¬ 
polated and the interpolation can only start after two time steps, hence the guard. 

The LES required other changes as well, because it used a fixed wind profile rather than 
taking data from WRE. However, these changes are not specific to model coupling so they 
are not included. 

4.1.3 Per-model API 

The changes to the model code are very small because the GMCE operation is abstracted 
into a per-model API. This API consists of a small number of subroutines: 

• The Init call initializes GMCE for the given model. 

• The Sync call synchronizes the model with the other models it’s communicating with. 

• The Pre and Post calls are taking care of the actual data exchange, Pre is before com¬ 
putation, Post is after computation. 

• The Finished call informs all other models that the given model has finished. 

These subroutines are currently written manually, the next step is generate the code based 
on annotations. 

• Eor Init and Sync, what is required is the time step information relative to some refer¬ 
ence, typically the model with the largest time step. 
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• For the Pre and Post calls, we need to know the names, types and sizes of the variables 
containing the data to be exchanged. 

• The Finished call can be generated without extra information 

There are also some API calls that are not as generic as the ones above: the data received 
from another model must be assigned to a variable in the given model, and often we only 
want to send a portion of a multidimensional array, so we may want to create intermediate 
variables. In the case of WRF and LES, we have gmcfInterpolateWindprofileLes(u,v,w) and 
gmcfCreateWmdprofileWrf(u,v,w) but actually the u,v,w are not quite identical in both mod¬ 
els. Currently, these subroutines have to be written by hand. In order to generate the code 
for these subroutines, we need to describe exactly how a variable from one model maps to 
a variable of another model. Our approach is to create an intermediate variable for each 
exchange and define the mapping to that variable for each model. 

4.2 Building and Evaluation 

Thanks to the extensive code generation and automation, building the GMCF executable 
is quite simple: first, specify the models to couple using a configuration file. This file is 
used to build the GMCF framework. Then build each model into a library. Then link the 
GMCF framework library and the model libraries into the top-level executable. Detailed 
instructions are available on GitHub. 

For the evaluation we tested the correctness of the time step synchronization and the data 
exchange. And actual evaluation of the performance requires the WRF input files used for 
creating the simulation that generates the wind profile for the LES, and this has not been 
done yet. 

However, we have successfully built and run a model coupling of WRF using OpenMP on 
a multicore GPU with the OpenCL LES running on the GPU. 


5 Conclusions 

The overall conclusions of this two-month pilot project are very encouraging: 

• We have demonstrated that our approach to model coupling, aimed at modern het¬ 
erogeneous manycore nodes, can be used to couple a complex NWP simulator such 
as WRF, parallelized using OpenMI’ with a custom LES simulator running on a GPU 
using OpenGL. 

• The LES is a very high-resolution simulator, therefore GPU acceleration was essential 
and we have shown that we can achieve a speed-up of seven times using OpenGL on 
the GPU. 

• Furthermore, our model coupling framework is designed to be useable for automatic 
generation from scenarios and specifications, and this will be the focus of our future 
work. Already, using GMGF requires only very minor modifications to the original 
model source code and build system. 
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• All software developed during this project has been open sourced and is available at 
https://github.com/wimvanderbauwhede. 

This work was conducted during a research visit at the Disaster Prevention Research Insti¬ 
tute of Kyoto University, supported by an EPSRC Overseas Travel Grant, EP/L026201/1. 
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